Author Disambiguation: A Nonparametric Topic and Co-authorship Model

نویسندگان

  • Andrew M. Dai
  • Amos J. Storkey
چکیده

A fully generative model is provided for the problem of author disambiguation. This approach infers the topics for each author and combines that with co-author information. The problems involved are similar to other entity resolution problems where differing references may refer to one author entity and identical references may refer to different author entities. We extend the hierarchical Dirichlet process and nonparametric latent Dirichlet allocation models to tackle this problem in a nonparametric, generative manner making no prior assumptions on the number of author entities, topics or research groups in the corpus. The model develops a hierarchical Dirichlet process for author-topic combinations. It conditions this model at document level on another hierarchical Dirichlet process for research groups. This enables the authors and topics to be suitably coupled. We perform joint inference to sample the author entities, topics and their group memberships. We present results from our approach on real-world datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How Related is Author Topical Similarity to Other Author Relatedness Measures?

Using a dataset of 26,228 Psychology document surrogates from Elsevier databases, we compare author relatedness measure outcomes for 125 authors based on topic modelling to more traditional approaches that rely on direct citation, co-citation and collaboration. Outcomes for the author topical similarity measure are compared to existing co-authorships in the dataset using UCINET/NetDraw. We demo...

متن کامل

On co-authorship for author disambiguation

Author name disambiguation deals with clustering the same-name authors into different individuals. To attack the problem, many studies have employed a variety of disambiguation features such as coauthors, titles of papers/publications, topics of articles, emails/affiliations, etc. Among these, co-authorship is the most easily accessible and influential, since inter-person acquaintances represen...

متن کامل

Twitter-Network Topic Model: A Full Bayesian Treatment for Social Network and Text Modeling

Twitter data is extremely noisy – each tweet is short, unstructured and with informal language, a challenge for current topic modeling. On the other hand, tweets are accompanied by extra information such as authorship, hashtags and the user-follower network. Exploiting this additional information, we propose the Twitter-Network (TN) topic model to jointly model the text and the social network i...

متن کامل

Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names

With the increasing size of digital libraries it has become a challenge to identify author names correctly. The situation becomes more critical when different persons share the same name (homonym problem) or when the names of authors are presented in several different ways (synonym problem). This paper focuses on homonym names in the computer science bibliography DBLP. The goal of this study is...

متن کامل

Cost-effective on-demand associative author name disambiguation

Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors fr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009